# Social Network Clustering Analysis

## Overview

This repository contains three R scripts for analyzing social network data using Random Forest-based clustering methods. 

1. `1_Data_Cleaning.R` - Data preprocessing and feature engineering
2. `2_Cluster_Algorithm.R` - Full clustering analysis (Optional)
3. `3_Cluster_Algorithm_Simple.R` - Simplified analysis (main version for manuscript)

## Requirements

**R Version:** 4.0.0 or higher

**Required Packages:**
```r
install.packages(c("plyr", "writexl", "readxl", "randomForest", "cluster", 
                   "anocva", "ClustOfVar", "psych", "tibble", "clustree", 
                   "fmsb", "haven", "readr", "patchwork"))
```

## Data Requirements

### Input Files

Place these files in your working directory:

- `Data_wave1_210411.rdata` - Wave 1 survey data
- `Data_wave2_210601.rdata` - Wave 2 survey data

If your files have different names, edit the configuration section in `1_Data_Cleaning.R`:

```r
WAVE1_RDATA <- "your_wave1_file.rdata"
WAVE2_RDATA <- "your_wave2_file.rdata"
```

## Usage

### Step 1: Data Cleaning

Run the data cleaning script first:

```r
source("1_Data_Cleaning.R")
```

This script loads the two waves, recodes variables, calculates network metrics (kin/non-kin composition, support patterns, homophily, tie strength), and outputs:

- `Second&firstwave.xlsx` - Merged raw data
- `covnetps_forRF.xlsx` - Cleaned variables for clustering

### Step 2: Full Clustering Analysis (Optional)

Run the comprehensive analysis:

```r
source("2_Cluster_Algorithm.R")
```

This script performs:

1. Unsupervised Random Forest on 43 variables (Wave 1 only)
2. Agglomerative clustering with 2-50 cluster solutions
3. Supervised Random Forest validation for each solution
4. Error analysis and optimal cluster selection
5. Variable clustering and PCA composite creation
6. Second-stage clustering on 10 composites
7. 8-cluster solution analysis with radar charts
8. Wave 2 cluster prediction

**Outputs** (in `Result/full_version/`):
- Variable importance plots
- Prediction accuracy charts
- Radar charts showing cluster profiles
- Z-scored cluster mean tables
- Final datasets (.Rdata, .csv, .dta)

### Step 3: Simplified Analysis 

Run the streamlined version:

```r
source("3_Cluster_Algorithm_Simple.R")
```

This script implements an alternative approach:

1. Variable clustering stability analysis
2. Creates 10 PCA composite variables directly
3. Unsupervised Random Forest on composites
4. Both spectral and agglomerative clustering (k=1-20)
5. Clustree visualization comparing methods
6. Supervised validation and 8-cluster selection
7. Radar charts and Wave 2 prediction

**10 Composite Variables:**
- Kin support, kin involvement
- Non-kin support, non-kin distance, non-kin tie strength, non-kin online
- Homophily, school-based, work-based
- Outdoor activities

**Outputs** (in `Result/simplified_version/`):
- Variable clustering plots
- Clustree comparison
- Radar charts
- Z-scored profiles
- Final datasets

## Output Files

**Data files:**
- `Second&firstwave.xlsx` - Complete merged raw data
- `covnetps_forRF.xlsx` - Analytical dataset
- `Mean_zscore_table*.xlsx` - Cluster profiles (z-scores)
- `cov_netps_nettype_0612.*` - Final data with clusters (.Rdata, .csv, .dta)

**Visualizations:**
- Variable importance plots (PDF)
- Prediction accuracy plots (PDF)
- Clustree flow diagrams (PNG)
- Radar charts (PNG)
- Proximity density plots (PDF)

## Interpreting Results

**Radar Charts:** Each cluster is shown as a polygon with spokes representing the 10 composite dimensions. Distance from center indicates z-score (positive = above average, negative = below average).

**Z-Score Tables:** Standardized cluster means showing how each cluster differs from the sample average. Values > 1.0 indicate substantially above average, < -1.0 substantially below average.

**Prediction Error:** Clusters with error < 0.20 are high quality and stable. Both scripts filter to retain only reliable clusters.

## Troubleshooting

### File not found
Check your working directory with `getwd()` and ensure data files are present. Update file paths in the script configuration if needed.

### Memory errors
Reduce the number of Random Forest trees from 10,000 to 5,000 or 1,000 for testing. Close other applications to free memory. On Windows, use `memory.limit(size = 16000)` to increase allocation.

### Package installation failures
Ensure R version ≥ 4.0.0. Try `install.packages("package_name", type = "source")` if standard installation fails. Some packages may require compilation tools (Rtools on Windows, Xcode on macOS).

### RData object not found
Load your file manually with `load("file.rdata")` then `ls()` to see object names. Add the correct name to `WAVE1_OBJECT_CANDIDATES` or `WAVE2_OBJECT_CANDIDATES` in the script.
